arXiv preprint

LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues

UCLA
5memory abilities
451questions
up to 500trajectories
up to 115Mtokens

Overview

Can memory systems turn long web-agent histories into reusable environment experience?

LongMemEval-V2 evaluates whether memory systems can help agents acquire the experience needed to become knowledgeable colleagues in customized environments. The benchmark pairs manually curated questions with long histories of multimodal web-agent trajectories. Memory systems consume the trajectory history and return compact evidence for downstream question answering. Both accuracy and query latency are targeted metrics.

Examples of LongMemEval-V2 questions across memory abilities

Benchmark

What memory abilities distinguish an experienced colleague from a stateless web agent?

LongMemEval-V2: Core Memory Abilities

Static state recallremembers important landmarks, page layouts, module affordances, and subtle state differences.
Dynamic state trackingunderstands how states and actions change the environment over time.
Workflow knowledgeknows the steps needed to complete recurring tasks in customized environments.
Environment gotchasrecognizes recurring local failure modes and avoids environment-specific traps.
Premise awarenessdetects assumptions that are valid elsewhere but wrong in the current deployment.

Benchmark construction

1Collectweb-agent trajectories
2Annotatememory questions
3Labelanswer evidence
4Assemblehistory haystacks
Question distribution statistics
LME-V2 questions span domains, types, and answer formats.
Haystack trajectory and token statistics
LME-V2 history haystacks are large, while answer-bearing trajectories remain sparse.
Group Benchmark Domain Sessions Tokens Context MM Questions Question MM Static Dynamic Workflow Gotchas Premise
General long context LongBench V2 Mixed N/A 260K No 503 No Yes No No No No
General long context MemoryAgentBench Mixed N/A 285K No 2,071 No Yes No No No No
General long context CL-Bench Mixed N/A 10K No 1,899 No Yes No Yes No No
Conversational long context LoCoMo User-user chat 28 ~16K Yes 7,512 No Yes Yes No No Yes
Conversational long context LongMemEval-V1 User-assistant chat 48-475 115K-1.5M No 500 No Yes Yes No No Yes
Conversational long context PersonaMem User-assistant chat 5-60 26K-951K No 5,990 No Yes Yes No No No
Conversational long context PersonaMem-v2 User-assistant chat 10-20 33K-124K Yes 5,000 No Yes Yes No No No
Conversational long context BEAM User-assistant chat 4.5-100 124K-10M No 2,000 No Yes Yes No No Yes
Agentic long context MemoryArena Agent mixed 7 40K+ No 766 No Yes No Yes Yes No
Agentic long context AgentLongBench Game agent 1 31K-4M No 6,400 No Yes Yes No No No
Agentic long context EMemBench Game agent 1 2K-infinity Yes 1,280+ No Yes No Yes Yes Yes
Agentic long context FileGramBench File-system agent 12 11K Yes 4,333 No Yes No Yes No No
Agentic long context AMA-Bench Agent mixed 1 57K No 2,496 No Yes Yes Yes Yes No
Agentic long context LongMemEval-V2 Web agent 100-498 25M-115M Yes 451 Yes Yes Yes Yes Yes Yes

Data Viewer

Questions

Trajectories

Evaluation

LME-V2 evaluates memory systems on context gathering.
1Insertstream trajectories
2Memorystore experience
3Querygather evidence
4Readeranswer from context

A memory system implements the Insert and Query API.

We sequentially insert trajectories, then query for a length-bounded multimodal memory context per question. A fixed reader answers from that context.

Insert(h) Query(q) → context Reader(q, context) → answer

The evaluation reports answer accuracy and query latency.

AgentRunbook

We design two strong memory baselines: a RAG pipeline and a file-based coding agent memory controller.

AgentRunbook-R

RAG memory with three knowledge pools for targeted recall.

  • Raw state slices for fine-grained UI evidence.
  • State transition events for environment dynamics.
  • Procedure and hint notes for workflows and gotchas.

AgentRunbook-C

File-based memory that invokes a coding agent to gather evidence.

  • Trajectory files stored directly on disk.
  • Workflow documents and memory manifests.
  • Helper scripts for state and trajectory inspection.
AgentRunbook memory module overview

Results

AgentRunbook-C currently gives the strongest average accuracy while improving the accuracy-latency frontier.
Method Family Small Overall Small Latency Medium Overall Medium Latency
No retrieval Reader only 1.3% 0s 1.3% 0s
RAG: query to slice RAG 42.8% 0.1s 38.1% 0.1s
RAG: query to slice + notes RAG 51.0% 0.2s 45.9% 0.3s
AgentRunbook-R RAG 58.6% 26.9s 57.0% 25.8s
Codex Coding agent 69.9% 177.2s 68.7% 185.8s
AgentRunbook-C Coding agent 74.9% 108.3s 70.1% 139.9s
Accuracy and latency tradeoff for memory methods
AgentRunbook improves the accuracy-latency frontier for long-term agent memory.

Leaderboard

Goal: accurate and efficient memory

LongMemEval-V2 rewards memory systems that retrieve useful experience without making downstream agents wait. The leaderboard therefore uses a metric called LAFS Gain that captures how much a submitted system improves the accuracy-latency frontier formed by the released baselines and AgentRunbook.

For a set of operating points \(F\), let \(A_F(T)\) be the best accuracy available with query latency at most \(T\). We calculate LAFS as average reachable accuracy over log-scaled latency budgets:

\[ \mathrm{LAFS}(F) = \frac{1}{\log(T_{\max}/T_{\min})} \int_{T_{\min}}^{T_{\max}} A_F(T)\,\frac{dT}{T}, \qquad A_F(T)=\max\left(\{0\}\cup\{a:(a,t)\in F,\ t\le T\}\right) \]

We use \(T_{\min}=1\mathrm{s}\) and \(T_{\max}=200\mathrm{s}\). With fixed reference frontier \(F_{\mathrm{ref}}=\{\text{RAG slice+notes},\text{Codex},\text{AgentRunbook-R},\text{AgentRunbook-C}\}\), we score a submission \(S\) as:

\[ \Delta\mathrm{LAFS}(S\mid F_{\mathrm{ref}}) = \mathrm{LAFS}(F_{\mathrm{ref}}\cup S)-\mathrm{LAFS}(F_{\mathrm{ref}}) \]

A score of 0 means the method does not beat the released reference frontier at any latency budget.

LME-V2-Small

System Method Family
Leaderboard entries coming soon.

LME-V2-Medium

System Method Family
Leaderboard entries coming soon.

Citation

@article{wu2026longmemevalv2,
      title={LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues},
      author={Di Wu and Zixiang Ji and Asmi Kawatkar and Bryan Kwan and Jia-Chen Gu and Nanyun Peng and Kai-Wei Chang},
      year={2026},
      eprint={2605.12493},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.12493},
}